164 research outputs found

    A case for code-representative microbenchmarks

    Get PDF
    Microbenchmarks are fundamental in the design of a microarchitecture. They allow rapid evaluation of the system, while incurring little exploration overhead. One key design aspect is the thermal design point (TDP), the maximum sustained power that a system will experience in typical conditions. Designers tend to use hand-coded microbenchmarks to provide an estimation for TDP. In this work we make the case for a systematic methodology to automatically generate code-representative microbenchmarks that can be used to drive the TDP estimation

    Improving scalability of task-based programs

    Get PDF
    In a multi-core era, parallel programming allows further performance improvements, but with an important programmability cost. We envision that the best approach to parallel programming that can exceed the programability, parallelism, power, memory and reliability walls in Computer Architecture is a run-time approach. Many traditional computer architecture concepts can be revisited and applied at the runtime layer in a completely transparent way to the programmer. The goal of this work is taking the computer architecture value prediction and data-prefetching concepts inside a runtime environment like OmpSs

    Explaining dynamic cache partitioning speed ups

    Get PDF
    Cache partitioning has been proposed as an interesting alternative to traditional eviction policies of shared cache levels in modern CMP architectures: throughput is improved at the expense of a reasonable cost. However, these new policies present different behaviors depending on the applications that are running in the architecture. In this paper, we introduce some metrics that characterize applications and allow us to give a clear and simple model to explain final throughput speed ups.Peer ReviewedPostprint (published version

    Design trade-offs for emerging HPC processors based on mobile market technology

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in The Journal of Supercomputing. The final authenticated version is available online at: http://dx.doi.org/10.1007/s11227-019-02819-4High-performance computing (HPC) is at the crossroads of a potential transition toward mobile market processor technology. Unlike in prior transitions, numerous hardware vendors and integrators will have access to state-of-the-art processor designs due to Arm’s licensing business model. This fact gives them greater flexibility to implement custom HPC-specific designs. In this paper, we undertake a study to quantify the different energy-performance trade-offs when architecting a processor based on mobile market technology. Through detailed simulations over a representative set of benchmarks, our results show that: (i) a modest amount of last-level cache per core is sufficient, leading to significant power and area savings; (ii) in-order cores offer favorable trade-offs when compared to out-of-order cores for a wide range of benchmarks; and (iii) heterogeneous configurations help to improve processor performance and energy efficiency.Peer ReviewedPostprint (author's final draft

    TaskPoint: sampled simulation of task-based programs

    Get PDF
    Sampled simulation is a mature technique for reducing simulation time of single-threaded programs, but it is not directly applicable to simulation of multi-threaded architectures. Recent multi-threaded sampling techniques assume that the workload assigned to each thread does not change across multiple executions of a program. This assumption does not hold for dynamically scheduled task-based programming models. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to system noise and variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals we employ a novel fast-forward mechanism for dynamically scheduled programs. We evaluate the proposed technique on a set of 19 task-based parallel benchmarks and two different architectures. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 64 simulated threads by an average factor of 19.1 at an average error of 1.8% and a maximum error of 15.0%.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Mont-Blanc project (EU-FP7-610402 and EU-H2020-671697). M. Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship JCI-2012-15047. M. Casas is supported by the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the EUFP7 (contract 2013BP B 00243). T.Grass has been partially supported by the AGAUR of the Generalitat de Catalunya (grant 2013FI B 0058).Peer ReviewedPostprint (author's final draft

    POSTER: SPiDRE: accelerating sparse memory access patterns

    Get PDF
    Development in process technology has led to an exponential increase in processor speed and memory capacity. However, memory latencies have not improved as dramatically and represent a well-known problem in computer architecture. Cache memories provide more bandwidth with lower latencies than main memories but they are capacity limited. Locality-friendly applications benefit from a large and deep cache hierarchy. Nevertheless, this is a limited solution for applications suffering from sparse and irregular memory access patterns, such as data analytics. In order to accelerate them, we should maximize usable bandwidth, reduce latency and maximize moved data reuse. In this work we explore the Sparse Data Rearrange Engine (SPiDRE), a novel hardware approach to accelerate these applications through near-memory data reorganization.This work has been supported by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P, Ramon y Cajal fellowship number RYC-2016-21104 and FPI fellowship number BES-2017-080635), and by the Arm-BSC Centre of Excellence initiative.Peer ReviewedPostprint (author's final draft

    Power efficient job scheduling by predicting the impact of processor manufacturing variability

    Get PDF
    Modern CPUs suffer from performance and power consumption variability due to the manufacturing process. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability. In this work we show that parallel systems benefit from taking into account the consequences of manufacturing variability when making scheduling decisions at the job scheduler level. We also show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensure that power consumption stays under a system-wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications, utilizing up to 4096 cores in total. We demonstrate that they decrease job turnaround time, compared to contemporary scheduling policies used on production clusters, up to 31% while saving up to 5.5% energy.Postprint (author's final draft

    Adapting cache partitioning algorithms to pseudo-LRU replacement policies

    Get PDF
    Recent studies have shown that cache partitioning is an efficient technique to improve throughput, fairness and Quality of Service (QoS) in CMP processors. The cache partitioning algorithms proposed so far assume Least Recently Used (LRU) as the underlying replacement policy. However, it has been shown that the true LRU imposes extraordinary complexity and area overheads when implemented on high associativity caches, such as last level caches. As a consequence, current processors available on the market use pseudo-LRU replacement policies, which provide similar behavior as LRU, while reducing the hardware complexity. Thus, the presented so far LRU-based cache partitioning solutions cannot be applied to real CMP architectures. This paper proposes a complete partitioning system for caches using the pseudo-LRU replacement policy. In particular, the paper focuses on the pseudo-LRU implementations proposed by Sun Microsystems and IBM, called Not Recently Used (NRU) and Binary Tree (BT), respectively. We propose a high accuracy profiling logic and a cache partitioning hardware for both schemes. We evaluate our proposals' hardware costs in terms of area and power, and compare them against the LRU partitioning algorithm. Overall, this paper presents two hardware techniques to adapt the existing cache partitioning algorithms to real replacement policies. The results show that our solutions impose negligible performance degradation with respect to the LRU.Peer ReviewedPostprint (published version

    A vulnerability factor for ECC-protected memory

    Get PDF
    Fault injection studies and vulnerability analyses have been used to estimate the reliability of data structures in memory. We survey these metrics and look at their adequacy to describe the data stored in ECC-protected memory. We also introduce FEA, a new metric improving on the memory derating factor by ignoring a class of false errors. We measure all metrics using simulations and compare them to the outcomes of injecting errors in real runs. This in-depth study reveals that FEA provides more accurate results than any state-of-the-art vulnerability metric. Furthermore, FEA gives an upper bound on the failure probability due to an error in memory, making this metric a tool of choice to quantify memory vulnerability. Finally, we show that ignoring these false errors reduces the failure rate on average by 12.75% and up to over 45%.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316- P), by the Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017- SGR-1328), by the Spanish Government (Severo Ochoa grant SEV-2015- 0493) and by the European Union’s Horizon 2020 research and innovation programme (grant agreements 671697 and 779877). L. Jaulmes has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2013/06982. M. Moreto and M. Casas have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowships RYC-2016-21104 and RYC-2017-23269.Peer ReviewedPostprint (author's final draft

    DReAM: Per-task DRAM energy metering in multicore systems

    Get PDF
    Interaction across applications in DRAM memory impacts its energy consumption. This paper makes the case for accurate per-task DRAM energy metering in multicores, which opens new paths to energy/performance optimizations, such as per-task energy-aware task scheduling and energy-aware billing in datacenters. In particular, the contributions of this paper are (i) an ideal per-task energy metering model for DRAM memories; (ii) DReAM, an accurate, yet low cost, implementation of the ideal model (less than 5% accuracy error when 16 tasks share memory); and (iii) a comparison with standard methods (even distribution and access-count based) proving that DReAM is more accurate than these other methods.This work has been partially supported by the Spanish Ministry of Science and Innovation under grant TIN2012-34557, the HiPEAC Network of Excellence, by the European Research Council under the European Union’s 7th FP, ERC Grant Agreement n. 321253, and by a joint study agreement between IBM and BSC (number W1361154). Qixiao Liu has also been funded by the Chinese Scholarship Council under grant 2010608015.Postprint (published version
    • …
    corecore